In previous labs, we’ve explore supervised machine learning, where we have a target: a variable (y) that we want to predict as a function of a set of features (X). In this lab, we’ll explore unsupervised learning, where there is no distinct target. These methods are generally used to explore complex data by reducing it in some way; either by reducing the number of variables or dimensions, or by grouping observations into clusters. Here, we’ll look at three clustering methods:
\(k\)-means clustering
Hierarchical clustering
Self-organizing maps
The data we will use is from the Gap Minder project. While you can download the individual variables from the web site, we will use a preprocessed set of the data covering the period 1801-2018, in the file gapminder_1800_2018.csv. Download this to your datafiles folder and make a new folder for today’s class called lab06. You will also need the shapefile of country borders: ne_50m_admin_0_countries.shp (you should have this from a previous lab).
Objectives
Set up data to run different unsupervised classification methods
Understand and work with the output
Use self-organizing maps with geographical data
Set up
Start RStudio and set the working directory to this directory (This can be changed by going to the [Session] menu and selecting [Set working directory] -> [Choose directory…], then browsing to the folder).
You will need the following packages for today’s lab, so make sure to install anything that is missing before proceeding.
tidyverse
ggpubr
RColorBrewer
countrycode
sf
kohonen
tmap
Once you have set everything up, we can start by reading in the data.
country year child_mortality fertility per_cap_co2 income life_expectancy
1 Afghanistan 1800 469 7 NA 603 28.2
2 Afghanistan 1801 469 7 NA 603 28.2
3 Afghanistan 1802 469 7 NA 603 28.2
4 Afghanistan 1803 469 7 NA 603 28.2
5 Afghanistan 1804 469 7 NA 603 28.2
6 Afghanistan 1805 469 7 NA 603 28.2
population
1 3280000
2 3280000
3 3280000
4 3280000
5 3280000
6 3280000
Now load the following directories to process and visualize the data:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)library(RColorBrewer)
We can now make some plots to look at the data. First, histograms of the three variables in the data (child mortality, fertility, per capita CO2 emissions, GDP per capita, life expectancy and population size):
The histograms clearly show skew in most of the variables, and we will need to transform the data before analyzing it. We’ll do this in two steps. First, we’ll log transform all variables to remove the skew, then use the scale() function, which in R scales data to \(z\)-scores by subtracting the mean and dividing by the standard deviation. To simplify things, we do all this in a single step by nesting the log() function in the scale() function.
Note that we remove rows with missing observations (we could potentially impute these, but that is a little more difficult for time series data). We also remove any row where the per capita CO2 emissions are zero, partly because this would cause problems when log-transforming and partly because this is clearly not true.
We’ll next use these data with the \(k\)-means cluster algorithm. The default function for this in R is kmeans(), and we need to pass the data we are going to use, and the number of requested clusters to this function. First let’s figure out which columns we need:
There output of this algorithm provides information about the cluster solution. We can see the size of each cluster (the number of observations assigned to each cluster)
gap_kmeans$size
[1] 42 43 26 16 35 22
And we can see the cluster assignments for each country:
And finally the cluster centers. These are the prototypes; the average value of each variable for all the observations assigned to a given cluster (note these are the log-transformed and scaled data)
Here, we can see that cluster 6, for example, has the lowest life expectancy and GDP, as well as high infant mortality and fertility rates.
Maps
As these are spatial data, we can now plot the distribution of the clusters. First, we’ll load the world shapefile in the ne_50m_admin_0_countries folder (we used this in a previous lab:)
library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
Now we need to merge the gap_df_scale2 data frame with the world borders spatial object. To do this, we add a new column to the data frame containing the ISO3 standard country codes, using the countrycode package (you’ll need to install this).
Finally, we use these ISO codes to merge the two datasets. We need to specify the column name for each dataset that contains the label to be used in merging, which we do with by.x and by.y.
Finally, we can plot out the clusters using the tmap package, which highlights the position of this poorer cluster 6 in Saharan and sub-Saharan Africa:
[v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
tm_scale(<HERE>).
[v3->v4] `tm_legend()`: use 'tm_legend()' inside a layer function, e.g.
'tm_polygons(..., fill.legend = tm_legend())'
[tip] Consider a suitable map projection, e.g. by adding `+ tm_crs("auto")`.
[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "Set2" is named
"brewer.set2"
Multiple palettes called "set2" found: "brewer.set2", "hcl.set2". The first one, "brewer.set2", is returned.
Hierarchical cluster analysis
We’ll briefly demonstrate the use of hierarchical clustering here. This works a little differently in R, as it requires you to first calculate a dissimilarity matrix (the aggregate difference between each pair of observations), and then use this for clustering.
The R function dist() will calculate this, based on simple Euclidean dissimiarity between samples:
gap_d <-dist(gap_df_scale2[,3:8])
We can now use this to carry out hierarchical clustering using the hclust() function:
gap_hclust <-hclust(gap_d)
And you can then plot the resulting dendrogram with:
plot(gap_hclust)
To actually get a cluster assigment for each observation, we need to ‘cut’ the tree in a way that it will result in the desired number of groups, using the well-named cutree() function, To get 6 clusters:
To see what this has done, you can overlay the clusters on the dendogram as follows:
plot(gap_hclust)rect.hclust(gap_hclust, k =6)
The clusters that you obtain from the cutree() function can be used in the same way that we used the \(k\)-means clusters. Here, we’ll append them to the spatial object (cluster_sf) and map them out:
[v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
tm_scale(<HERE>).
[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "Set2" is named
"brewer.set2"
Multiple palettes called "set2" found: "brewer.set2", "hcl.set2". The first one, "brewer.set2", is returned.
Self-organizing maps
We’ll now repeat this analysis with a self-organizing map (SOM), but using the full dataset. For this we’ll use the kohonen package, which allows the construction of unsupervised and supervised SOMs.
library(kohonen)
Attaching package: 'kohonen'
The following object is masked from 'package:purrr':
map
Building the SOM is relatively simple: first we construct the grid of nodes (30x20) using the somgrid() function. Note the topo argument that controls the neighborhood shape (rectangular or hexagonal):
You can plot the empty grid to see the layout with (plot(som_grid)).
Now we’ll train the SOM. The function we use is som(), and we need to specify the data to be used as well as the SOM grid. The function requires the data to be as a matrix rather than an R data frame, so we first convert this, then pass it to the function:
Once finished, we need to see if the algorithm has converged. We can do this by plotting the change in the loss function as a function of the iterations. As a reminder, the loss function here is the mean distance between the node weights and the observations assigned to that node:
plot(gap_som, type ="changes")
We can see here that the loss function has decreased, but not stabilized suggesting that the algorithm has converged. We’ll re-run it for a larger number of iterations by setting rlen to 1000 (this may take a couple of minutes to run):
gap_som =som(gap_mat, som_grid, rlen =1000)plot(gap_som, type ="changes")
This plot shows a series of step like drops in the loss function, finally stabilizing at approximately iteration 900-1000.
Plots
Next we’ll go through some visualizations of the data. We’ve already see then loss function changes, so next we’ll make a couple of plots to show how the SOM has trained. First a plot of distances between nodes, which can be useful to identify outliers:
plot(gap_som, type ="dist")
Next, we plot the number of observations per cluster to look for empty nodes.
plot(gap_som, type ="counts")
THe map has a couple of empty nodes. Empty nodes can be common when using larger grid sizes, and can be taken to represent a part of the parameter space that is missing in the observations.
Next, we’ll plot the codebook vectors. These are the weights associated with each feature, and can be used to interpret the organization of the SOM. For each node, there is a radial plot showing the value of each feature associated with it. Note that the type of plot can change here. If you have a large number of features, this will plot as a line plot rather than a radial plot.
plot(gap_som, type ="code")
The map shows a clear gradient from richer regions (income, orange) on the top left to poorer on the right, and a gradient from smaller populations at the top to smaller at the bottom. Note that the life expectancy follows the gradient of GDP fairly well, and the poorer regions tend to have higher infant mortality and fertility rates.
It is also possible to plot out the individual feature weights as a heatmap to illustrate the gradient using type = "property". The codebook vectors of weights are contained within the gap_som object. The syntax to extract them is a little complicated, and some examples are given below
Population
plot(gap_som, type ="property", property = gap_som$codes[[1]][,"population"],main ="Pop")
Life expectancy
plot(gap_som, type ="property", property = gap_som$codes[[1]][,"life_expectancy"],main ="Life Exp.")
Income (per capita GDP)
plot(gap_som, type ="property", property = gap_som$codes[[1]][,"income"],main ="Income")
Per capita CO2 emissions
plot(gap_som, type ="property", property = gap_som$codes[[1]][,"per_cap_co2"],main ="Per capita CO2")
Cluster analysis
As a next step, we’ll cluster the nodes of the map to try and identify homogenous regions to simplify the interpretation of the SOM. While \(k\)-means is often used here, we’ll use a hierarchical algorithm (for reasons that will be clear shortly) to form 6 clusters. Again, we first calculate the dissimilarity matrix, but this time between the values (codes) of each of the SOM nodes. Then we use the cutree() and hclust() function to form the clusters
We can now show this on the SOM grid using the plot() function, setting type to mapping and using the cluster values in som_cluster to set the colors. The cluster boundaries can also be overlaid with add.cluster.boundaries()
# Define a color palettemypal =brewer.pal(6, "Set2")plot(gap_som, type="mapping", bgcol = mypal[som_cluster], main ="Clusters", pch ='.') add.cluster.boundaries(gap_som, som_cluster)
The main problem with this is that the clusters are not-contiguous on the SOM grid. This reflects the non-linear mapping that a SOM carries out (and may also suggest that 6 clusters is too high). We can force the clusters to be contiguous by including the distance between the nodes on the SOM grid as well as the dissimilarity between codebook values. We’ve already calulated this latter value above (as codes_dist), so let’s get the distance between the nodes using the handy helper function unit.distances():
dist_grid <-unit.distances(som_grid)
We then multiply these two distance/dissimilarity matrices together:
dist_adj <-as.matrix(dist_codes) * dist_grid
If we now repeat the clustering, the inclusion of the distances between nodes now forces these clusters to be contguous:
Next, we obtain aggregate values for each cluster. While we can get these values from the gap_som object, we can also work back to get values from the original data (in gap_df). In the following code we:
Extract a vector with the node assignments for each country
Use this to get the cluster number for the node that the country is assigned to
Attach this to the original gap2_df dataframe as a factor (rather than an integer)
To get the values for each cluster on the original, non-transformed scale, we need to merge gap_df_scale which contains the cluster assignments with the original data (gap_df)
gap_df_clus <-merge(gap_df, gap_df_scale, by =c("country", "year"))
Now we can use this list of clusters by country to aggregate the variables:
cluster.vals <-aggregate(gap_df_clus[,3:8], by =list(gap_df_scale$somclust), mean)
Which gives the following table:
Group.1
child_mortality.x
fertility.x
per_cap_co2.x
income.x
life_expectancy.x
population.x
1
251.48969
5.341362
1.1584654
2768.943
45.94086
63467467.8
2
142.16838
4.859589
2.0997180
5594.270
57.60789
8911883.7
3
27.25484
2.159607
7.7293525
20151.552
72.78047
57435306.8
4
299.04001
6.141442
0.2062727
1630.047
39.15094
6395727.5
5
40.27428
3.219797
6.4329921
16656.107
69.25189
2638169.5
6
164.66853
6.253948
2.1849984
4590.799
54.46803
548892.6
This shows the approximate characteristics for the clusters:
High life expectancy and GDP, low mortality and fertility rates, medium population
Low life expectancy and GDP, very high mortality and fertility rates, medium population
Low life expectancy and GDP, high mortality and fertility rates, small population
Medium life expectancy, low GDP, medium mortality and fertility rates, large population
High life expectancy and GDP, lower mortality and fertility rates, large population
High life expectancy and medium GDP, medium mortality and fertility rates, small population
As we have data over multiple years, we can also use the results of the SOM analysis to track the change made by an individual country over time. To do this, we need to know which node a country was assigned to for each of it’s time steps. To do this we
Find all the observation/row numbers belonging to a country (e.g. here for the United States)
Find the node assigned to each of those observations
Find the node coordinates (on the grid) for each of those nodes
Now we plot out the grid again, shaded with the cluster assignments. Then we use R’s line() function to overlay the country’s trajectory. In the last couple lines, we make up a vector of years to label the trajectory. As there will be a lot of these, we only keep the label for 20 year increments. Then we add these to the plot
And finally, we can plot out the spatial distribution of these clusters for any given year in the dataset. First we need to assign the country codes to link to the shapefile:
[v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
tm_scale(<HERE>).
[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "Set2" is named
"brewer.set2"
Multiple palettes called "set2" found: "brewer.set2", "hcl.set2". The first one, "brewer.set2", is returned.
Exercise
For the exercise, you will need to build use an unsupervised classification method with the Cancer dataset from the lab 4 (data_atlantic_1998_2012.csv). As a reminder, this file contains information on cancer rates from around 660 counties in the eastern part of the US. There is also a shapefile with the polygons for each county (COUNTY_ATLANTIC.shp).
For the exercise, you will need to cluster these data and map out the resulting clusters. You will need to carry out the following steps:
Read in the data and select the columns: Cancer, Smoking, Poverty, PM25, SO2 and NO2
Scale the data using a \(z\)-score transform
Cluster the data using either\(k\)-means or a self-organizing map
Extract the cluster number for each county and add it to the shapefile
Make a map of the clusters
Use a Quarto document to record your answers and output. Assignments, to include both the Quarto document and (ideally) the compiled HTML file, should be submitted to Canvas. Please use the following naming convention: Lab06_lastname.
Appendix
Gap Minder Dataset
GapMinder dataset on global inequality (1800 to 2018): gapminder_1800_2018.csv